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Abstract 

We sfMflfy /zow fo efficiently diffuse updates to a large dis- 
tributed system of data replicas, some of which may exhibit 
arbitrary (Byzantine) failures. We assume that strictly fewer 
than t replicas fail, and that each update is initially received 
by at least t correct replicas. The goal is to diffuse each 
update to all correct replicas while ensuring that correct 
replicas accept no updates generated spuriously by faulty 
replicas. To achieve reliable diffusion, each correct replica 
accepts an update only after receiving it from at least t oth- 
ers. We provide the first analysis of epidemic- style protocols 
for such environments. This analysis is fundamentally dif- 
ferent from known analyses for the benign case due to our 
treatment of fully Byzantine failures — which, among other 
things, precludes the use of digital signatures for authenti- 
cating forwarded updates. We propose two epidemic-style 
diffusion algorithms and two measures that characterize the 
efficiency of diffusion algorithms in general. We character- 
ize both of our algorithms according to these measures, and 
also prove lower bounds with regards to these measures that 
show that our algorithms are close to optimal. 

1 Introduction 



A diffusion protocol is the means by which an update 
initially known to a portion of a distributed system is prop- 
agated to the rest of the system. Diffusion is useful for driv- 
ing replicated data toward a consistent state over time, and 
has fo und appl ication for this purpos e, e.g., in USENET 



News JLQM94|] and in the Grapevine [ |BLNS82[ | and Clear- 
inghouse [ |0D81 ] systems. The quality of a diffusion pro- 
tocol is typically defined by the delay until the update has 
reached all replicas, and the amount of message traffic that 
the protocol generates. 
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In this paper, we provide the first study of update dif- 
fusion in distributed systems where components can suf- 
fer Byzantine failures. The framework for our study is a 
network of data replicas, of which strictly less than some 
threshold t can fail arbitrarily, and to which updates are in- 
troduced continually over time. For example, these updates 
may be sensor readings of some data source that is sampled 
by replicas, or data that the source actively pushes to repli- 
cas. However, each update is initially received only by a 
subset of the correct replicas of some size a > t, and so 
replicas engage in a diffusion protocol to propagate updates 
to all correct replicas over time. Byzantine failures impact 
our study in that a replica that does not obtain the update 
directly from the source must receive copies of the update 
from at least t different replicas before it "accepts" the up- 
date as one actually generated by the source (as opposed to 
one generated spuriously by a faulty replica). 

In our study, we allow fully Byzantine failures, and thus 
cannot rely on digital signatures to authenticate the origi- 
nal source of a message that one replica forwards to oth- 
ers. While maximizing the fault models to which our up- 
per bounds apply, avoiding digital signatures also strength- 
ens our results in other respects. First, in a network that is 
believed to intrinsically provide the correct sender address 
for each message due to the presumed difficulty of forging 
that address, avoiding digital signatures avoids the adminis- 
trative overheads associated with distributing cryptographic 
keys. Second, even when the sender of a message is not 
reliably provided by the network, the sender can be authen- 
ticated using techniques that require no cryptographic as- 
sumptions (for a survey of these techniques, see [Sim92]). 
Employing digital signatures, on the other hand, would 
require assumptions limiting the computational power of 
faulty replicas. Third, pairwise authentication typically in- 
curs a low computation overhead on replicas, whereas dig- 
itally signing each message would impose a significantly 
higher overhead. 

To achieve efficient diffusion in our framework, we sug- 
gest two round-based algorithms: "Random", which is an 
epidemic-style protocol in which each replica sends mes- 
sages to randomly chosen replicas in each round, and "£- 
Tree-Random", which diffuses updates along a tree struc- 
ture. For these algorithms, two measures of quality are stud- 



ied: The first one, delay, is the expected number of rounds 
until any individual update is accepted by all correct repli- 
cas in the system. The delay measure expresses the speed 
of propagation. The second, fan-in, is the expected max- 
imum number of messages received by any replica in any 
round from correct replicas. Fan-in is a measure of the load 
inflicted on individual replicas in the common case, and 
hence, of any potential bottlenecks in execution. We eval- 
uate these measures for each of the protocols we present. 
In addition to these results, we prove a lower bound of 
Q(j£nr log — ) on the delay of any diffusion protocol, where 
pout j s j-jjg "fan-out" of the protocol, i.e., a bound on the 
number of messages sent by any correct process in any 
round. We also show an inherent tradeoff between good 
(low) latency and good (low) fan-in, namely that their prod- 
uct is at least Q(tn/a). Using this tradeoff, we demonstrate 
that our protocols cover much of the spectrum of optimal- 
delay protocols for their respective fan-in to within logarith- 
mic factors. 

We emphasize that our treatment of full Byzantine fail- 
ures renders our problem fundamentally different from the 
case of crash failures only. Intuitively, any diffusion process 
has two phases: In the first phase, the initially active repli- 
cas for an update send this update, while the other replicas 
remain inactive. This phase continues while inactive repli- 
cas have fewer than t messages. In the second phase, new 
replicas become active and propagate updates themselves, 
resulting in an exponential growth of the set of active repli- 
cas. In Figure [l] we depict the progress of epidemic diffu- 
sion. The figure shows the number of active replicas plot- 
ted against round number, for a system of n = 100 repli- 
cas with different values of t, where a = t + 1. The case 
t = 1 is indistinguishable from diffusion with benign fail- 
ures only, since a single update received by a replica im- 
mediately turns it into an active one. Thus, in this case, 
the first phase is degenerate, and the exponential-growth 
phase occurs from the start. Previous work has analyzed 
the diffusion p rocess in that case, proving propagation de- 
lay [ |DGH+87| 1 that is logarithmic in the number of replicas. 
However, in the case that we consider here, i.e., t > 2, the 
delay is dominated by the initial phase. 

The r est of the paper is organized as follows. In Sec- 
tion [TTT| we illustrate specific applications for which Byzan- 
tine message diffusion is suitable, and which motivated our 
study. We discuss related work in Section 1.2. In Section 



we lay out assumptions and notation used throughout the 
paper, and in Section || we define our measures of diffusion 
performance. In Section ^| we provide general theorems re- 
garding the delay and fan-in of diffusion protocols. In Sec- 
tion H we introduce our first diffusion protocol, Random, 
and analyze its properties, and in Section || we describe the 
£-Tree-Random protocol and its properties. We summarize 
and discuss our results in Section 0. Section ^ provides sim- 
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Figure 1. Delay of random propagation with 

n = 100, a = t+l 



ulation results that demonstrate the likely behavior of our 
protocols in practice. We conclude in Section ||. 

1.1 Motivation 

The motivating application of our work on message dif- 
fusion is a data replication system called Fleet. (Fleet is not 
yet documented, but is based on similar design principles as 
a predecessor system called Phalanx [ |MR98b| ].) Fleet repli- 
cates data so that it will survive even the malicious corrup- 
tion of some data replicas, and does so using ad aptations of 
quorum systems to such environments [ MR98a ], A charac- 
teristic of these replication techniques that is important for 
this discussion is that each update is sent to only a relatively 
small subset (quorum) of servers, but one that is guaranteed 
to include t correct ones, where the number of faulty repli- 
cas is assumed to be less than t. Thus, after an update, most 
correct replicas have not actually received this update, and 
indeed any given correct replica can be arbitrarily out-of- 
date. 

While this local inconsistency does not impact the global 
consistency properties of the data when the network is con- 
nected (due to the properties of the quorum systems we em- 
ploy), it does make the system more sensitive to network 
partitions. That is, when the network partitions — and thus 
either global data consistency or progress of data opera- 
tions must be sacrificed — the application may dictate that 
data operations continue locally even at the risk of using 
stale data. To limit how stale local data is when the net- 
work partitions, we use a diffusion protocol while the net- 
work is connected to propagate updates to all replicas, in 
the background and without imposing additional overhead 



on the critical path of data operations. In this way, the sys- 
tem can still efficiently guarantee strict consistency in case 
a full quorum is accessed, but can additionally provide re- 
laxed consistency guarantees when only local information 
is used. 

Another v ariation on quorum sys tems, probabilistic quo- 
rum systems [ |MRW97| , |MRWW98] ], stands to benefit from 
properly designed message diffusion in different ways than 
above. Probabilistic quorum systems are a means for gain- 
ing dramatically in performance and resilience over tradi- 
tional (strict) quorum systems by allowing a marginal, con- 
trollable probability of inconsistency for data reads. When 
coupled with an effective diffusion technique, the probabil- 
ity of inconsistency can be driven toward zero when updates 
are sufficiently dispersed in time. 

More generally, diffusion is a fundamental mechanism 
for driving replicated data to a consistent state in a highly 
decentralized system. Our study sheds light on the use of 
diffusion protocols in systems where arbitrary failures are a 
concern, and may form a basis of solutions for disseminat- 
ing critical information in survivable systems (e.g., routing 
table updates in a survivable network architecture). 

1.2 Related work 

The style of update diffusion studied here has previ- 
ously been studied in systems t hat can suf fer benign failures 
only. Notably, Demers et.al. [ DGH+87 ] performed a de- 
tailed study of epidemic algorithms for the benign setting, 
in which each update is initially known at a single replica 
and must be diffused to all replicas with minimal traffic 
overhead. One of the algorithms they studi ed, called anti- 
entropy and apparently initially proposed in [ BLNS82 ], wa s 
adopted in Xerox's Cleari nghouse p roject (see [ pGH+87 | ) 
and the Ensemble s ystem [BHO+98| ]. Similar ideas also un- 
derly IP-Mult icast JPee89[ ] and MUSE (for USENET News 
propagation) [ LOM94 1 . This anti-entropy technique forms 
the basis for one of the algorithms (Random) that we study 
here. As described previously, however, the analysis pro- 
vided here of the epidemic-style update diffusion is funda- 
mentally different for Byzantine environments than for en- 
vironments that suffer benign failures only. 

Prior studies of update diffusion in distributed systems 
that can suffer Byzantine failures have focused on single- 
source broadcast protocols that provide reliable commu- 
nication to repli cas and replica agreement o n the broad- 
cast value (e.g., [ |LSP82l |DS83j [BT85| , |MR96| ]), sometimes 
with additional ordering guarantees on the delivery of up 



dates from different sources (e.g., [Rei94, CASD95, MM95 



KMM98]). The problem that we consider here is differ- 



ent from these works in the following ways. First, in these 
prior works, it is assumed that one replica begins with each 
update, and that this replica may be faulty — in which case 



the correct replicas can agree on an arbitrary update. In 
contrast, in our scenario we assume that at least a thresh- 
old t > 1 of correct replicas begin with each update, and 
that only these updates (and no arbitrary ones) can be ac- 
cepted by correct replicas. Second, these prior works focus 
on certain reliability, i.e., guaranteeing that all correct repli- 
cas (or all correct replicas in some agreed-upon subset of 
replicas) receive the update. Our protocols diffuse each up- 
date to all correct servers only with some probability that is 
determined by the number of rounds for which the update is 
propagated before it is discarded. Our goal is to analyze the 
number of rounds until the update is expected to be diffused 
globally and the load imposed on each replica as measured 
by the number of messages it receives in each round. 

2 System model 

We assume a system of n replicas, denoted Pi,. . ■ ,p n - 
A replica that conforms to its I/O and timing specifications 
is said to be correct. A faulty replica is one that deviates 
from its specification. A faulty replica can exhibit arbitrary 
behavior (Byzantine failures). We assume that strictly fewer 
than t replicas fail, where t is a globally known system pa- 
rameter. 

Replicas can communicate via a completely connected 
point-to-point network. Communication channels between 
correct replicas are reliable and authenticated, in the sense 
that a correct replica pt receives a message on the commu- 
nication channel from another correct replica pj if and only 
if pj sent that message to pi. Moreover, we assume that 
communication channels between correct replicas impose a 
bounded latency A on message transmission; i.e., commu- 
nication channels are synchronous. Our protocols will also 
work to diffuse updates in an asynchronous system, but in 
this case we can provide no delay or fan-in analysis. Thus, 
we restrict our attention to synchronous systems here. 

Our diffusion protocols proceed in synchronous rounds. 
A system parameter, fan-out, denoted F out , bounds from 
above the number of messages any correct replica sends in a 
single round. A replica receives and processes all messages 
sent to it in a round, before the next round starts. Thus, 
rounds begin at least A time units apart. 

Each update u is introduced into the system at a set 
of a > t correct replicas, and possibly also at some other, 
faulty replicas. We assume that all replicas in I u initially 
receive u simultaneously (i.e., in the same round). The goal 
of a diffusion protocol is to cause u to be accepted at all 
correct replicas in the system. The update u is accepted at 
correct replica pi if pi 6 /„ or pi has received u from t other 
distinct replicas. If pi has accepted u, then we also say that 
Pi is active for u (and is passive otherwise). In all of our 
diffusion protocols, we assume that each message contains 
all the updates known to the sender, though in practice, ob- 



vious techniques can reduce the actual number of updates 
sent to necessary ones only. 

3 Measures 

We study two complexity measures: delay and fan-in. 
For each update, the delay is the expected number of rounds 
from the time the update is introduced to the system until 
all correct replicas accept the update. Formally, let r\ u be 
the round number in which update u is introduced to the 
system, and let r p be the round in which a correct replica p 
accepts update u. The delay is E[ma,x p {Tp } — i] u ], where 
the expectation is over the random choices of the algorithm 
and the maximization is over correct replicas. 

We define Fan-in to be the expected maximum number 
of messages that any correct replica receives in a single 
round from correct replicas under all possible failure sce- 
narios. Formally, let p p be the number of messages received 
in round i by replica p from correct replicas. Then the fan-in 
in round i is E[max p ^c{p p }}, where the maximum is taken 
with respect to all correct replicas p and all failure config- 
urations C containing fewer than t failures. An amortized 
fan-in is the expected maximum number of messages re- 
ceived over multiple rounds, normalized by the number of 
rounds. Formally, a fc-amortized fan-in starting at round I 
is Elmaxp^ciY^ifi P p /k}]- We emphasize that fan-in and 
amortized fan-in are measures only for messages from cor- 
rect replicas. Let F m denotes the fan-in. In a round a cor- 
rect replica may receive messages from F m + 1 — 1 differ- 
ent replicas, and may receive any number of messages from 
faulty replicas. 

A possible alternative is to define fan-in as an absolute 
bound limiting the number of replicas from which each cor- 
rect replica will accept messages in each round. However, 
this would render the system vulnerable to "denial of ser- 
vice" attacks by faulty replicas: by sending many messages, 
faulty replicas could force messages from correct replicas to 
compete with up to t — 1 messages from faulty replicas in 
every round, thus significantly changing the behavior of our 
protocols. 

4 General Results 

In this section we present general results concerning the 
delay and fan-in of any propagation algorithm. Our first 
result is a lower bound on delay, that stems from the restric- 
tion on fan-out, F out . This lower bound is for the worst 
case delay, i.e., when faulty replicas send no messages. 

Theorem 4.1 The delay of any diffusion algorithm A is 
O(^log^). 

Proof: Let u be any update, and let rrik denote the total 
number of times u is sent by correct processes in rounds 



T] u + 1, . . . , r) u + k in A. Denote by at the number of 
correct replicas that have accepted update u by the time 
round r/ u + k completes. Since t copies of update u need 
to reach a replica (not in I u ) in order for it to accept the up- 
date, we have that a/~ < a + m k / 1. Furthermore, since at 
most F out otk new updates are sent by correct processes in 
round rj u + k + 1, we have that rrik+i < mk + F out otk < 

F out J2 j=0 a j> where 

ao = a. By induction on k, it 

can be shown that ak < a(l + ^—) k . Therefore, for 
k < j^t log ^ we have that ak < n, which implies that 
not all the replicas are active for update u. □ 

The next theorem shows that there is an inherent tradeoff 
between fan-in and delay. 

Theorem 4.2 Let A be any propagation algorithm. Denote 
by D its delay, and by F m its D-amortized fan-in. Then 
DF in = Q(tn/a),fort> 21ogn. 

Proof: Let u be any update. Since the D-amortized fan- 
in of A is F m , with probability 0.9 (where 0.9 is arbitrarily 
chosen here as some constant between and 1), the number 
of messages received (from correct replicas) by any replica 
in rounds r/ u + 1, . . . , r) u + D is less than lODF m . From 
now on we will assume that every replica pj receives at 
most 10DF m messages in rounds r\ u + 1, . . . , t] u + D. This 
means that for each pj, if pj is updated by a set Sj of repli- 
cas during rounds t) u + 1,..., r) u + D, then \Sj\ < 10DF m . 
Some replica becomes active for u if out of the updates in 
Sj at least t are from /„, i.e. \Sj fl I u \ > t. In order to show 
the lower bound, we need to exhibit an initial set /„, such 
that if 10DF m is too small then no replica becomes active. 
More specifically, for D < \ 10 pL a > we show that there 
exists a set /„ such that for each pj, we have \Sj fl I u \ < t. 

We choose the initial set I u as a random subset of 
{pi, . . . ,p n } of size a. Let Xj denote the number of repli- 
cas in /„ from which messages are received by replica pj 
during rounds rj u + l, . . . , r] u +D, i.e., Xj = \SjC\Iu\. Since 
Pj receives at most 10DF m messages in these rounds, we 
get 

10DF in rlODF in \/n-10DF in \ 

Prob[X 3 >k] < ]T 1 1 ^ — ' 

i=k ya) 

^ flODF m \ (ay 

< g( , )U 

[\QeDF in a\ k 



where the constant c is at most 2 if D < \ 10 eF in a -> anc * 
hence we have that Prob[Xj > t] < (1/2)*. By our as- 
sumption that t > 21ogn, we have that Prob[Xj > t] < 
1/n 2 . This implies that the probability that all the Xj are at 
most t is at least 1 — (1/n). 



We have shown that for most subsets I u if D < 
h i n ^j,, no new replica would become active. Therefore, 
for some specific I u it also holds. (In fact it holds for most.) 

Recall that at the start of the proof we assumed that the 
D-amortized fan-in is at most X0F m . This holds with prob- 
ability at least 0.9. Therefore in 0.9 of the runs the delay is 
at least 1 1Q j£. ;„ a , which implies that the expected delay is 
fj( »f_). 6 ' □ 

5 Random Propagation 

In this section, we present a random diffusion method 
and examine its delay and fan-in measures. In this algo- 
rithm, which we refer to as simply "Random", each replica, 
at each round, chooses F out replicas uniformly at ran- 
dom from all replicas and sends messages to them. This 
method is similar to the "anti-entropy" method of [ |3LNS82 , 
DGH+87| ] 



In the next theorem we use the notation of 
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which is the result of the analysis of the coupon collector 
problem, i.e., the expected number of steps for collecting t 
distinct 'coupons' out of ft different ones by random polling 
(see [MR95 ch. 3]). It is worth discussing how Rp t be- 
haves for various values of ft and t. For ft — t we have 
R/3,t ~ tlogt. For ft > 2t we have Ra t < 1.5*. For all 
ft > t, we have Rp t > t. This implies that if the initial set 
size ft is very close to t, then we have a slightly superlinear 
behavior of Rpj as a function of t, while if ft is a fraction 
away from t then we have Rpj as a linear function in t. 

Theorem 5.1 The delay of the Random algorithm is 
O + ip^) for 2 < t < n/4. 

Proof: The outline of the proof is as follows. For the 
most part, we consider bounds on the number of messages 
sent, rather than directly on the number of rounds. It is more 
convenient to argue about the number of messages, since the 
distribution of the destination of each replica's next message 
is fixed, namely uniform over all replicas. As long as we 
know that there are between a and 2a replicas active for u, 
we can translate an upper bound on the number of messages 
to an approximate upper bound on the number of rounds. 

More specifically, so long as the number ft of active 
replicas does not reach a quarter of the system, i.e., a < 
ft < n/4, we study m + (/3), an upper bound on the number 
of messages needed to be sent such that with high probabil- 
ity, 1— q + (ft), we have ft new replicas change state to active. 
We then analyze the algorithm as composed of phases start- 
ing with ft = 2 J a. The upper bound on the number of mes- 
sages to reach half the system is 53i=o m+ ' a )> tn e bound 



on the number of rounds is J2j=o m + (2 :i a)/ (2 :i F out a), 

and the error probability is at most X^=o 9 + (2 Jq 0> where 
£ = log(n/2a) — 1. In the analysis we assume for simplic- 
ity that n = 2 J a for some j, and this implies that in the last 
component we study, there are at most n/4 active replicas. 

At the end, we consider the case where ft > n/4, and 
bound from above the number of rounds needed to complete 
the propagation algorithm. This case adds only an additive 
factor of 0((t + log n) /F out ) to the total delay. 

We start with the analysis of the number of messages 
required to move from ft active replicas to 2ft, where ft < 
n j '4. For any m, let iV™ be the number of messages that 
Pi received, out of the first m messages, and let S" 1 be the 
number of distinct replicas that sent the N™ messages. Let 
C/™ be an indicator variable such that U" 1 = 1 if pi receives 
messages from t or more distinct replicas after m messages 
are sent, and U" 1 = otherwise. I.e. U™ = 1 if and only if 
S™ > t. 

We now use the coupon collector's analysis to bound the 
probability that S™ > t when AT?" messages are received. 
Thus, a replica needs to get an expected Rp t messages be- 
fore S™ > t, and so with probability < 1/2 it would need 
more than 2Rp t messages to collect t different messages. 
For m < n + 2i?^ jt we have that 

Prob[Ul n = 1] 

> Prob[Nl n = 2R Ptt ]Prob[Ul n = l|JVf = 2R p . t ] 

~ \2RpJ \n) \ n) \2 

> I I I — i ( 
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Let U m denote the number of replicas that received mes- 
sages from t or more replicas after m messages are sent, 
i.e., U m — ^22=0+1 U? 1 ' where the active replicas are 
pi, . . .pp. For ft < n/4 we have, 



E[U m ] > (n-ft) 



m \ p I 1 



> 



n ( m \ 
12 \2nRp,t) 



2nRp,t 



where the right inequality uses the fact that ft < n/4. 
Our aim is to analyze the distribution of U m . More 
specifically, we would like to find m + (ft) such that, 

Prob[U m > 2ft] > l-q + {ft) 

for any m > m + (ft). 



Generally, the analysis is simpler when the random vari- 
ables are independent. Unfortunately, the random vari- 
ables U™ are not independent, but using a classical result 
by Hoeffding [ fr-[oe63 , Theorem 4], the dependency works 
only in our favor. Namely, let X" 1 be i.i.d. binary random 
variables with Prob[X™ = 1] = Pro6[t/™ = 1], and 
YJU X™. Then, 

Prob[U rn - E[U m ] > 7] < Prob[X m - E[X m ] > 7] . 
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From now on we will prove the bounds for X m and they 
will apply also to U m . First, using a Chernoff bound 
(see [|KV94]]) we have that, 
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For m+({3) = 2nRp. t (2A(3/n) 1/2R e- t , we have 
E[X m+ W] > 2/3, and hence 

Prob[X m+ ^ <I3]< e-^ 4 = q+{(3) . 

For the analysis of the Random algorithm, we view the 
algorithm as running in phases so long as (3 < nj 4. There 
will be t = log(n/2a) — 1 phases, and in each phase we 
start with (3 — 2^a initial replicas, for < j < £. The jth 
phase runs for m + {2 3 a) / ( K F out 2 3 a) rounds. We say that 
a phase is "good" if by the end of the phase the number 
of active replicas has at least doubled. The probability that 
some phase is not good is bounded by, 

3=0 \j=0 J 

for a > 6. Assuming that all the phases are good, at the end 
half of the replicas are active. 

The number of rounds until half the system is active is at 
most, 
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where we used here the fact that Rpj is a decreasing func- 
tion in (3. 

We now reach the last stage of the algorithm, when /3 > 
n/2. Unfortunately, there are too few passive replicas to 
use the analysis above for m + {(3), since we cannot drive the 
expectation of X m any higher than (3. We therefore employ 
a different technique here. 



We give an upper bound on the expected number of 
rounds for completion at the last stage. Fix any replica p, 
and let Vi be the number of new update in round i that p 
receives. Since t < n/4, we have (3 — t > n/4, and so: 

pout pout 

E[V i ] = {(3-t)—>—. 

Let V r denote the number of new updates received by p in 
r rounds, hence V r = J2l=i Vi- Then ' E [V r ] > rF out /4. 
Using the Chernoff bound we have, 



Prob[V r < rF out /8] < 



-F ou V/64 



Letr+ = (8i+1281og(n)) / F out . The probability that V r 
is less than t is at most 1/n 2 . The probability that some 
replica receives less than t new updates in r + rounds is thus 
less than 1/n, and so in an expected 0((t + log(n)) / F° ut ) 
rounds the algorithm terminates. 

Putting the two bounds together, we have an expected 



log(") 
F' 



r-) number of rounds. 



□ 



The proof of the theorem reveals that it takes the same 
order of magnitude of rounds just to add a more replicas to 
be active as it is to make all the replicas active. This is due 
to the phenomena that having more replicas active reduces 
the time to propagate the update. This is why we have a 
rapid transition from not having accepted the update by any 
new replicas to having them accepted by all replicas. 

Note that when t = £!(log n), then simply by sending to 
replicas in a round-robin fashion, the initially active replicas 
can propagate an update in 0( pt ut ) rounds to the rest of 
the system. The Random algorithm reaches essentially the 
same bound in this case. This implies that the same delay 
would have been reached if the replicas that accepted the 
update would not have participated in propagating it (and 
only the original set of replicas would do all the propagat- 
ing). Finally, note that in failure-free runs of the system, the 
upper bound proved in Theorem 5. 1 is also the lower bound 
on the expected delay, i.e., it is tight. 

The next theorem bounds the fan-in of the random algo- 
rithm. Recall that the fan-in measure is with respect to the 
messages sent by the correct replicas. 

Theorem 5.2 The fan-in of the Random algorithm is 
0(F out + logn), and when F out < 1/4 log n, it is 
°( log (Note that when F out = 1, this fan-in 

is O( lo g f gj )- The (\ogn)-amortized fan-in is 0(F out ). 

Proof: The probability that a replica receives k messages 
or more in one round is bounded by (™ , ) (l/n) k , which 
is bounded by (eF out /k) k . For k = c(F out + logn), this 



bound is 0(l/n 2 ), for some c > 0. Hence the probability 
that any replica receives more than k = c(F out + logn) 
in a round is small. Therefore, the fan-in is bounded 

by 0(F 0Ut + logn). If F out < 1/4 log n, then for 

k = e loglogn-lo|F°"" MS b ° UIld iS °( 1 / n2 )' fol " SOme 

c > 0. Therefore, in this case, the fan-in is bounded by 

Q( -F°" t +logn \ 
\ log log n— log F out ' 

The probability that in log n rounds a specific replica re- 
ceives more than k = QF out log n messages is bounded by 
( nF ° Ut k logn )(l/n) k which is bounded by 1/n 2 . The proba- 
bility that any replica receives more than k = QF out log n 
messages is bounded by 1/n. Thus, the (log n) -amortized 
fan-in is at most 0{F out ). □ 

6 Tree-Random 

The Random algorithm above is one way to propagate 
an update. Its benefit is the low fan-in per replica. In this 
section, we devise a different approach that sacrifices both 
the uniformity and the fan-in in order to optimize the de- 
lay. We start with a specific instance of our approach, called 
Tree-Random. Tree-Random is a special case of a family of 
algorithms f -Tree-Random, which we introduce later. It is 
presented first to demonstrate one extremum, in terms of its 
fan-in and delay, contrasting the Random algorithm. 

We define the Tree-Random algorithm as follows. We 
partition the replicas into blocks of size At, and arrange 
these blocks on the nodes of a binary tree. For each replica 
there are three interesting sets of replicas. The first set is the 
At replicas at the root of the tree. The second and third sets 
are the At replicas at the right and left sons of the node that 
the replica is in. The total number of interesting replicas for 
each replica is at most 12t, and we call it the candidate set of 
the replica. In each round, each replica chooses F out repli- 
cas from its candidate set uniformly at random and sends a 
message to those replicas. 

Theorem 6.1 The delay of the Tree-Random algorithm is 
0(§m + W + j4r log(n/t))forn > 8t. 

Proof: Let u be any update. We say that a node in 
the tree is active for u if 2t correct replicas (out of the At 
replicas in the node) are active for u. We start by bound- 
ing the expected number of rounds, starting from r\ u , for 
the root to become active. The time until the root is ac- 
tive can be bounded by the delay of the Random algo- 
rithm with At + a replicas. Since on average one of ev- 
ery three messages is targeted at the root, within expected 
0(R a .t/ F out + log(a) / F out ) rounds the root becomes ac- 
tive. 

The next step of the proof is to bound how much time 
it takes from when a node becomes active until its child 
becomes active. We will not be interested in the expected 



time, but rather focus on the time until there is at least a con- 
stant probability that the child is active, and show a bound 
of 0{t/F out ) rounds. 

Given that 2t correct replicas in the parent node are ac- 
tive, each replica in the child node has an expectation of re- 
ceiving F out /12 updates from new replicas in every round. 
Using a Chernoff bound, this implies that in I = 96t/F° ut 
rounds each replica has a probability of e~* of not becom- 
ing active. The probability that the child node is not active 
(i.e. less of 2t of its replicas are active) after I rounds is 
bounded by b = 3ie-' < 5/6 for t > 2. 

In order to bound the delay we consider the delay un- 
til a leaf node becomes active. We show that for each 
leaf node, with high probability its delay is bounded by 
0(t\og(n/t)). Each leaf node has \og(n/At) nodes on the 
path leading from the root to it. Partition the rounds into 
meta-rounds, each containing £ rounds. For each meta- 
round there is a probability of at least 1 — b that another 
node on the path would become active. This implies that 
in k meta-rounds, we have an expected number of (1 — b) k 
active nodes on the path. Therefore, the probability that 
we have less than (1 — b)k/2 is at most e~( 1 ~ b ' >k / 8 . We 
have \og(n/At) nodes on the the path, this gives the con- 
straint that k > 2 \og(n/At)/(l — b). In addition we like 
the probability that there exists a leaf node that does not be- 
come active to be less than (t/n) 2 , which holds for k > 
161og(n/4*)/(l - b). Consider k = 161og(n/4t)/(l - b) 
meta rounds. Since there are at most n/At leaves in the tree, 
then with probability at least 1 — At/n > 1/2 the number of 
meta-rounds is at most k = 0(\og(n/t)). Thus, the delay 
is kt = 0{t\og(n/t)/F out ). This implies that the total ex- 
pected delay is bounded by 0(R att /F out + log(a) / F out + 
t\og{n/t)/F out ). □ 

Two points about this theorem are worthy of noting. 
First, we did not attempt to optimize for the best constants. 
In fact, we note that much of the constant factor in the Tree- 
Random propagation delay can be eliminated if we modify 
the algorithm to propagate messages deterministically down 
the tree (but continue selecting targets at random from the 
root node). 

Second, the Tree-Random algorithm gains its speed at 
the expense of a large fan-in. The replicas at the root of 
the tree receive 0(n) messages in each round of the proto- 
col, and therefore in practice, constitute a centralized bot- 
tleneck. Theorem [O] shows that in our model there is an 
inherent tradeoff between the fan-in and the delay. 

The next theorem claims a bound on the fan-in of the 
Tree-Random algorithm. 

Theorem 6.2 The fan-in of the Tree-Random algorithm is 
Q(nF out /t),forn = n(pl^logn). 

Proof: Any replica at the root has a probability of 
F out /(12t) receiving a message from any other replica. 



This implies that the expected number of messages per 
round is nF out /(12t), which establishes the lower bound. 
The probability that a replica receives more than 2 F 12t " 
is bounded by e - F ° ut "/ 3 ( 12t ) ( us ing the Chernoff bound). 
Since n = f2(-p| E? logn), the probability is bounded by 
l/?i 2 , and the theorem follows. □ 

We now define and analyze the generalized ^-Tree- 
Random method. We partition the replicas into blocks of 
size I, and arrange these blocks on the nodes of a binary 
tree. As in the Tree-Random algorithm, for each replica 
there are three interesting sets of replicas. The first set is 
the i replicas at the root of the tree. The second and third 
sets are the I replicas at the right and left sons of the node 
that the replica is in. The total number of replicas in the 
three sets is at most 31, and we call it the candidate set of 
the replica. In each round, each replica chooses F out repli- 
cas from its candidate set uniformly at random and sends a 
message to those replicas. 

Note that the Tree-Random propagation is simply setting 
£ = At and the random propagation is simply setting £ = n. 

Theorem 6.3 The £-Tree-Random algorithm has delay 



O 



Ra.t f £ + a 
pou 



1-1/t 



\og(£ + a) t 

F out - + j^t l0 g(n/l) 



and fan-in Q(nF out / £), for At <£< nF out / \ogn. 

Proof: The proof of the fan-in is identical to the one of 
the Tree-Random algorithm. We have £ replicas at the root. 
Each replica sends to each replica at the root with probabil- 
ity F OMt j3l. Therefore the expected number of updates to 
each replica in the root is nF out /3£, which establishes the 
lower bound on fan-in. With probability e~ nF / 3 ( 3 ^) < 
1 /n 2 , each replica receives less than 2nF° ut /3£ updates in 
a round. 

The proof on the delay has two parts. The first is 
computing the time it takes to make all the replicas in 
the root active. This can be bounded by the delay of 
the Random algorithm with £ + a replicas, and so is 

The second part is propagating on the tree. This part is 
similar to the Tree algorithm. As before, in each node at 
each round, each replica has a constant probability of re- 
ceiving messages from Q(F out ) new replica. This implies 
that with some constant probability 1 — b all the replicas 
in a node are active after O(t/F out ) rounds. The analysis 
of the propagation to a leaf node is identical to before, and 
thus this second stage takes 0(log(n/£)) meta-rounds and 
the total delay on the second stage is 0(t4st log(n/£)). □ 



7 Discussion 

Our results for the Random and £- Tree-Random algo- 
rithms are summarized in Table 



Using the fan-in/delay bound of Theorem i.2, we now 
examine our diffusion methods. The Random algorithm has 
O (log n) -amortized fan-in of 0(F out ), yielding a product 

of delay and amortized fan-in of O (ti^Y 1 ^^*^ + log(n)^ 
when a > 2t. This is slightly inferior to the lower bound 
in the range of t for which the lower bound applies. The 
Tree-Random method has fan-in (and amortized fan-in) 
of 0(nF out /t) and delay 0( l -^& + j4r log(n/*)) if 
a > 2t. So, their product is O( nlo f a) + n\og(n/t)), 
which again is inferior to the lower bound of Sl (tn/ a) since 
t/a < 1. However, recall from Theorem AA that the 



delay is always Q(j4^t log(^)), and so for the fan-in of 
0(nF out ft) it is impossible to achieve optimal delay /fan- 
in tradeoff. In the general £-Tree-Random method, putting 
£ > alog(n/a), the £-Tree algorithm exhibits a fan- 
in/delay product of at most 0(— ), which is optimal. If 
£ < alog(n/a), the product is within a logarithmic fac- 
tor from optimal. Hence, Tree propagation provides a spec- 
trum of protocols that have optimal delay/fan-in tradeoff to 
within a logarithmic factor. 

Our lower bound of il(j4rrt log — ) on the delay of any 
diffusion protocol says that we pay a high price for Byzan- 
tine fault tolerance: when t is large, diffusion in our model 
is (necessarily) slower than diffusion in system models ad- 
mitting only benign failures. By comparison, in systems 
admitting only benign failures there are known algorithms 
for diffusing updates with 0(log n) delay, including one on 
which the Random algorithm studied here is based [Pit87|. 



8 Simulation Results 

Figure |^ depicts simulation results of the Random and 
Tree-Random algorithms. The figure portrays the delay of 
the two methods for varying system sizes (on a logarithmic 
scale), where t was fixed to be 16. In part (a) of this figure, 
we took the size a of the initial set I u to be a = t + 1. This 
graph clearly demonstrates the benefit of the Tree-Random 
method in these settings, especially for large system sizes. 
In fact, we had to draw the upper half of the y-axis scale 
in this graph disproportionately in order for the small delay 
numbers of Tree-Random, compared with the large delay 
numbers exhibited by the Random method, to be visible. 
Part (b) of the graph uses a = y/2tn, which reflects the 
minimal initial set that we would use in the Fleet system, 
which i s on e of the primary motivations for our study (see 
Section 1.1). For such large initial sets, Random outper- 



forms Tree-Random for all feasible systems sizes, and the 
benefit of Tree-Random is only of theoretical interest (e.g., 
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Figure 2. Delay of Random and Tree-Random algorithms. 



n > 1000000). 

In realistic large area networks, it is unlikely that 100% 
of the messages arrive in the round they were sent in, even 
for a fairly large inter-round period. In addition, it may be 
desirable to set the inter-round delay reasonably low, at the 
expense of letting some messages arrive late. Some mes- 
sages may be dropped in realistic scenarios, and hence, to 
accommodate such failures, we also ran our simulations 
while relaxing our synchrony assumptions. In these simula- 
tions, we allowed some threshold — up to 5% — of the mes- 
sages to arrive in later rounds than the rounds they were 
sent or to be omitted by the receiver. The resulting behavior 
of the protocols were comparable to the synchronous set- 
tings. We conclude that our protocols can just as effectively 
be used in asynchronous environments in which the inter- 
round delay is appropriately tuned. 

9 Conclusion 

In this paper we have provided the first analysis of 
epidemic-style update diffusion in systems that may suffer 
Byzantine component failures. We require that no spurious 
updates be accepted by correct replicas, and thus that each 
correct replica receive an update from t other replicas be- 
fore accepting it, where the number of faulty replicas is less 



than t. In this setting, we analyzed the delay and fan-in of 
diffusion protocols. We proved a lower bound on the delay 
of any diffusion protocol, and a general tradeoff between 
the delay and fan-in of any diffusion protocol. We also pro- 
posed two diffusion protocols and analyzed their delay and 
fan-in. 
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